fix: surface test-stdout.txt in verifier exception for dep_install classification (#540) by ElegantLin · Pull Request #572 · benchflow-ai/benchflow

ElegantLin · 2026-05-27T08:31:18Z

Summary

Root cause: RewardFileNotFoundError in verifier.py only included file paths in its message, never pip's actual error output. The classify_verifier_error function had dep_install markers ("no solution found", etc.) but could never match them because they lived in test-stdout.txt, not the exception.
Fix: When raising RewardFileNotFoundError, tail the last 30 lines of test-stdout.txt into the exception message. This lets classify_verifier_error see pip's output and correctly return verifier_dep_install.
No changes to scoring.py — the classifier logic and markers were already correct (added in prior PRs); only the wiring was broken.

Test plan

4 new parametrized cases in test_classify_verifier_error covering all dep_install markers in exception messages
test_dep_install_takes_precedence_over_generic_crash — precedence test
TestRewardFileNotFoundSurfacesStdout — 4 end-to-end wiring tests: stdout tail present, classifier fires, missing file graceful, non-dep-install stays verifier_failure
All 96 verifier+scoring tests pass
ruff clean

Closes #540

🤖 Generated with Claude Code

…l classifier fires (#540) The verifier_error_category=dep_install classification could never fire in production because RewardFileNotFoundError only contained file paths, not pip's actual error output. Now the exception tails test-stdout.txt (last 30 lines) into the message, letting classify_verifier_error see markers like "no solution found" from real pip failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-05-27T08:33:07Z

+                    f"{self._rollout_paths.reward_json_path}"
+                )
+            if stdout_tail:
+                msg += f"\n--- test-stdout.txt (last 30 lines) ---\n{stdout_tail}"


🟡 Hardcoded "30" in error message drifts from _TAIL_LINES constant

The error message at src/benchflow/task/verifier.py:317 uses a hardcoded string "last 30 lines" instead of referencing the _TAIL_LINES constant defined at src/benchflow/task/verifier.py:63. If _TAIL_LINES is ever changed (e.g. to 50), _tail_file will return a different number of lines but the error message will still claim "last 30 lines", misleading anyone triaging verifier failures.

Suggested change

msg += f"\n--- test-stdout.txt (last 30 lines) ---\n{stdout_tail}"

msg += f"\n--- test-stdout.txt (last {_TAIL_LINES} lines) ---\n{stdout_tail}"

Was this helpful? React with 👍 or 👎 to provide feedback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bingran-you

Thermo-nuclear review verdict: request changes.

The dependency-install classification improvement is useful, and the real Claude ACP smoke passed, but the current implementation surfaces raw verifier stdout through RewardFileNotFoundError / verifier_error. That widens the secret-leak blast radius because test-stdout.txt can contain env dumps, install URLs, tokens, or other untrusted subprocess output that later lands in summaries/dashboards.

Please redact before appending stdout tails, or keep raw stdout out of result metadata and classify from a private diagnostic path instead.

Additional maintainability issues:

_tail_file() reads/splits the whole file to return 30 lines. Please stream-tail it, e.g. with collections.deque(f, maxlen=n).
tests/test_verify.py grows past 1k lines. Please move these regression tests to a focused verifier-output module.
Regression test docstrings should name the reviewed PR/fix explicitly, e.g. PR #572 / #540, per repo convention.

Verification notes:

Branch-head checks passed: uv sync, full pytest, ty, ruff.
Current-merge pytest and ty passed.
Current-merge ruff failed only on existing main baseline UP041 in src/benchflow/sandbox/docker.py:408, not this PR.
Real claude-agent-acp Docker smoke passed: reward 1.0, trajectory_source=acp, 5 ACP events, 1 tool call, usage provider_response, total tokens 64349, llm_trajectory.jsonl present.

devin-ai-integration Bot reviewed May 27, 2026

View reviewed changes

fix: use _TAIL_LINES constant in error message instead of hardcoded 30

fe4ebe8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bingran-you requested changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: surface test-stdout.txt in verifier exception for dep_install classification (#540)#572

fix: surface test-stdout.txt in verifier exception for dep_install classification (#540)#572
ElegantLin wants to merge 2 commits into
mainfrom
fix/540-verifier-dep-install

ElegantLin commented May 27, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 27, 2026

Uh oh!

bingran-you left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	msg += f"\n--- test-stdout.txt (last 30 lines) ---\n{stdout_tail}"
	msg += f"\n--- test-stdout.txt (last {_TAIL_LINES} lines) ---\n{stdout_tail}"

Conversation

ElegantLin commented May 27, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

bingran-you left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ElegantLin commented May 27, 2026 •

edited by devin-ai-integration Bot

Loading